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Statement  of  the  problem  studied 


1.1  Introduction 

Modern  sensors  are  collecting  very  high-dimensional  data  at  unprecedented  volume  and  speed,  often 
from  platforms  with  limited  processing  power.  These  large  datasets  allow  scientists  and  analysts 
to  consider  richer  physical  models  with  larger  numbers  of  variables,  and  thereby  have  the  potential 
to  provide  new  insights  into  the  underlying  complex  phenomena.  Social  media  generate  troves  of 
data,  but  human  analysts  and  machines  cannot  quickly  and  accurately  identify  inappropriate  or 
dangerous  content.  The  Large  Hadron  Collider  (LHC)  at  CERN  “generates  so  much  data  that 
scientists  must  discard  the  overwhelming  majority  of  it  ~  hoping  hard  they’ve  not  thrown  away 
anything  useful.”  [23]  Typical  NASA  missions  collect  hundreds  of  terabytes  of  data  every  hour  [18]: 
the  Solar  Data  Observatory  generates  1.5  terabytes  of  data  daily  [36],  and  the  upcoming  Square 
Kilometer  Array  (SKA,  [19])  is  projected  to  generate  an  exabyte  of  data  daily,  “more  than  twice 
the  information  sent  around  the  internet  on  a  daily  basis  and  100  times  more  information  than 
the  LHC  produces”  [35].  In  these  and  a  variety  of  other  science  and  engineering  settings,  there 
is  a  pressing  need  to  recover  relevant  or  anomalous  information  accurately  and  efficiently  from  a 
high- dimensional,  high-velocity  data  stream. 

Rigorous  analysis  of  such  data  poses  major  issues,  however.  First,  we  are  faced  with  the  noto¬ 
rious  “curse  of  dimensionality” ,  which  states  that  the  number  of  observations  required  for  accurate 
inference  in  a  stationary  environment  grows  exponentially  with  the  dimensionality  of  each  observa¬ 
tion.  This  requirement  is  often  unsatisfied  even  in  so-called  “big  data”  settings,  as  the  underlying 
environment  varies  over  time  in  many  applications.  Furthermore,  any  viable  method  for  processing 
massive  data  must  be  able  to  scale  well  to  high  data  dimensions  with  limited  memory  and  computa¬ 
tional  resources.  Finally,  in  a  variety  of  large-scale  streaming  data  problems,  ranging  from  motion 
imagery  formation  to  network  analysis,  the  underlying  environment  is  dynamic  yet  predictable, 
but  many  general-purpose  and  computationally  efficient  methods  for  processing  streaming  data 
lack  a  principaled  mechanism  for  incorporating  dynamical  models.  Thus  a  fundamental  mathe¬ 
matical  and  statistical  challenge  is  accurate  and  efficient  tracking  of  dynamic  environments  with 
high-dimensional  streaming  data. 

Classical  stochastic  gradient  descent  methods,  including  the  least  mean  squares  (LMS)  or  recur¬ 
sive  least  squares  (RLS)  algorithms  do  not  have  a  natural  mechanism  for  incorporating  dynamics. 
Classical  stochastic  filtering  methods  such  as  Kalman  or  particle  filters  or  Bayesian  updates  [6] 
readily  exploit  dynamical  models  for  effective  prediction  and  tracking  performance.  However,  these 
methods  are  also  limited  in  their  applicability  because  (a)  they  typically  assume  an  accurate,  fully 
known  dynamical  model  and  (b)  they  rely  on  strong  assumptions  regarding  a  generative  model 
of  the  observations.  Some  techniques  have  been  proposed  to  learn  the  dynamics  [58,  53],  but  the 
underlying  model  still  places  heavy  restrictions  on  the  nature  of  the  data.  Performance  analysis 
of  these  methods  usually  does  not  address  the  impact  of  “model  mismatch” ,  where  the  generative 
models  are  incorrectly  specified. 

A  contrasting  class  of  prediction  methods,  receiving  widespread  recent  attention  within  the 
machine  learning  community,  is  based  on  an  “individual  sequence”  or  “universal  prediction”  [38] 
perspective;  these  strive  to  perform  provably  well  on  any  individual  observation  sequence  without 
assuming  a  generative  model  of  the  data.  Online  convex  programming  provides  a  variety  of  tools  for 
sequential  universal  prediction  [40,  8,  60,  17].  Here,  a  Forecaster  measures  its  predictive  performance 
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according  to  a  convex  loss  function,  and  with  each  new  observation  it  computes  the  negative  gradient 
of  the  loss  and  shifts  its  prediction  in  that  direction.  Stochastic  gradient  descent  methods  stem 
from  similar  principles  and  have  been  studied  for  decades,  but  recent  technical  breakthroughs  allow 
these  approaches  to  be  understood  without  strong  stochastic  assumptions  on  the  data,  even  in 
adversarial  settings,  leading  to  more  efficient  and  rapidly  converging  algorithms  in  many  settings. 

This  paper  describes  a  novel  framework  for  prediction  in  the  individual  sequence  setting  which 
incorporates  dynamical  models  -  effectively  a  novel  combination  of  state  updating  from  stochas¬ 
tic  hlter  theory  and  online  convex  optimization  from  universal  prediction.  We  establish  tracking 
regret  bounds  for  our  proposed  algorithm.  Dynamic  Mirror  Descent  (DMD),  which  characterize 
how  well  we  perform  relative  to  some  alternative  approach  [e.g.,  a  computationally  intractable 
batch  algorithm)  operating  on  the  same  data  to  generate  its  own  predictions,  called  a  “comparator 
sequence.”  Our  novel  regret  bounds  scale  with  the  deviation  of  this  comparator  sequence  from  a 
dynamical  model.  These  bounds  simplify  to  previously  shown  bounds  when  there  are  no  dynamics. 
In  addition,  we  describe  methods  based  on  DMD  for  adapting  to  the  best  dynamical  model  from 
either  a  finite  or  parametric  class  of  candidate  models.  In  these  settings,  we  establish  tracking 
regret  bounds  which  scale  with  the  deviation  of  a  comparator  sequence  from  the  best  sequence  of 
dynamical  models. 

While  our  methods  and  theory  apply  in  a  broad  range  of  settings,  we  are  particularly  interested 
in  the  setting  where  the  dimensionality  of  the  parameter  to  be  estimated  is  very  high.  In  this 
regime,  the  incorporation  of  both  dynamical  models  and  sparsity  regularization  plays  a  key  role. 
With  this  in  mind,  we  focus  on  a  class  of  methods  which  incorporate  regularization  as  well  as 
dynamical  modeling.  The  role  of  regularization,  particularly  sparsity  regularization,  is  increasingly 
well  understood  in  batch  settings  and  has  resulted  in  significant  gains  in  ill-posed  and  data-starved 
settings  [7,  45,  14,  9].  More  recent  work  has  examined  the  role  of  sparsity  in  online  methods  such 
as  recursive  least  squares  (RLS)  algorithms,  but  do  not  account  for  dynamic  environments  [3]. 

1.1.1  Organization  of  paper  and  main  contributions 

The  remainder  of  this  paper  is  structured  as  follows.  In  Section  1.2,  we  formulate  the  problem 
and  introduce  notation  used  throughout  the  paper,  and  Section  1.3  provides  some  background 
definitions  for  online  convex  optimization.  Our  Dynamic  Mirror  Descent  (DMD)  method,  along 
with  tracking  regret  bounds  are  presented  in  Section  2.1;  this  section  also  describes  the  application 
of  data-dependent  dynamical  models  and  their  connection  to  recent  work  on  online  learning  with 
predictable  sequences.  DMD  uses  only  a  single  series  of  dynamical  models,  but  we  can  use  it  to 
choose  among  a  family  of  candidate  dynamical  models.  This  is  described  for  finite  families  in 
Section  2.2  using  a  fixed  share  algorithm,  and  for  parametric  families  in  Section  2.3.  Section  2.4 
shows  experimental  results  of  our  methods  in  a  variety  of  contexts  ranging  from  imaging  to  self¬ 
exciting  point  processes.  Finally,  Section  2.5  makes  concluding  remarks  while  proofs  are  relegated 
to  Section  ??. 


1.2  Problem  formulation 

The  problem  of  sequential  prediction  is  posed  as  an  iterative  game  between  a  Forecaster  and  the 
Environment.  At  every  time  point,  t,  the  Forecaster  generates  a  prediction  Ot  from  a  closed,  convex 
set  0  C  M'^.  After  the  Forecaster  makes  a  prediction,  the  Environment  reveals  the  loss  function 
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iti')  where  £t  is  a  convex  function  which  maps  the  space  0  to  the  real  number  line.  We  will  assume 
that  the  loss  function  is  the  composition  of  a  convex  function  /t  :  0  — )•  M  from  the  Environment 
and  a  convex  regularization  function  r  :  0  — >•  M  which  does  not  change  over  time.  Frequently 
the  loss  function,  ft  will  measure  the  accuracy  of  a  prediction  compared  to  some  new  data  point 
xt  £  y.  where  X  is  the  domain  of  possible  observations.  The  regularization  function  promotes  low¬ 
dimensional  structure  (such  as  sparsity)  within  the  predictions.  We  additionally  assume  that  we 
can  compute  a  subgradient  of  it  or  ft  at  any  point  0  £  Q,  which  we  denote  Vit  and  V ft-  Thus  the 
Forecaster  incurs  the  loss  it{0t)  =  ft{0t)  +  'i"{0t)- 

The  goal  of  the  Forecaster  is  to  create  a  sequence  of  predictions  9i,92,  ■  ■  ■  ,0t  that  has  a  low 
cumulative  loss  Ylt  ^t{9t).  Because  the  loss  functions  are  being  revealed  sequentially,  the  prediction 
at  each  time  can  only  be  a  function  of  all  previously  revealed  losses  to  ensure  causality.  Thus,  the 
task  facing  the  Forecaster  is  to  create  a  new  prediction,  9t+i,  based  on  the  previous  prediction  and 
the  new  loss  function  it^),  with  the  goal  of  minimizing  loss  at  the  next  time  step.  We  characterize 
the  efficacy  of  6t={9i,  6*2, ... ,  9t)  £  0^  relative  to  a  comparator  sequence  6t—{9i,  02,  ■  ■  ■ ,  9t)  £ 
0^  using  a  concept  called  regret,  which  measures  the  difference  of  the  total  accumulated  loss  of 
the  Forecaster  with  the  total  accumulated  loss  of  the  comparator.  We  are  particularly  interested 
in  comparators  which  are  computationally  intractable  batch  algorithms;  in  a  sense,  then,  regret 
encapsulates  how  much  one  regrets  working  in  an  online  setting  as  opposed  to  a  batch  setting  with 
full  knowledge  of  past  and  future  observations; 

Definition  1  (Regret).  The  regret  of  9t  with  respect  to  a  comparator  6t  £  0^  is 

T  T 

RT{eT)=Y,it(9t)-Y,it{9t). 

t=l  t=l 


Our  goal  is  to  develop  an  online  convex  optimization  algorithm  with  low  (sublinear  in  T)  regret 
relative  to  a  broad  family  of  comparator  sequences.  Previous  work  proposed  algorithms  which 
yielded  regret  of  0(\/T)  for  the  relatively  small  family  of  static  comparators,  where  9t  =  9  for  some 
9  £  Q  and  all  t.  In  contrast,  our  main  result  is  an  algorithm  which  incorporates  a  dynamical  model, 
denoted  :  0  1— )■  0,  and  admits  a  tracking  regret  bound  of  the  form  0{^/T[1  +  “ 

1.3  Online  convex  optimization  preliminaries 

One  common  approach  to  forming  the  predictions  9t,  Mirror  Descent  (MD)  [40,  8],  consists  of 
solving  the  following  optimization  problem: 

9t+i  =  arg  min rjtCViti^t),  9)  +D{9\\0t),  (1.1) 

eee 

where  Vit{9)  denotes  an  arbitrary  subgradient  of  it  at  9,  D{9\\9t)  is  the  Bregman  divergence  [12,  15] 
between  9  and  9,  and  rjt  >  0  is  a  step  size  parameter.  Let  ?/)  :  0  — )•  M  denote  a  continuously 
differentiable  function  that  is  cr-strongly  convex  for  some  parameter  cj  >  0  and  some  norm  ||  •  ||: 

fj{9i)>  f^{92)  +  {Vi^{92),  01  -  92) +  ^\\9i- 02  f  (1.2) 
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The  Bregman  divergence  associated  with  ijj  is 


D{9^\\e2)H{0i)  -  HO2)  -  (VV'(02),01  -  92) 

>^\\ei-92f  (1.3) 

An  important  consequence  of  this  definition  is  the  following  generalization  of  the  law  of  cosines: 
for  all  01 , 02 )  ^3  £  0 


Zl(0i||02)  =I1(03||02)  + 11(01 1|03) 

+  (Vl/.(02)-VV'(03),03  -  0l).  (1.4) 

The  MD  approach  is  a  generalization  of  online  learning  algorithms  such  as  online  gradient 
descent  [60]  and  weighted  majority  [33].  Several  recently  proposed  methods  consider  the  data- 
fit  term  separately  from  the  regularization  term  [22,  57,  31].  For  instance,  consider  Composite 
Objective  Mirror  Descent  (COMID)  [22],  where: 

9t+i  =argminr/i(V/t(0t),0)  +  r?tr(0)  +Zl(0110i).  (1.5) 

eee 

This  formulation  is  helpful  when  the  regularization  function  r(0)  promotes  sparsity  in  0,  and  helps 
ensure  that  the  individual  0*  are  indeed  sparse,  rather  than  approximately  sparse  as  are  the  solutions 
to  the  MD  formulation. 

1.3.1  Static  regret 

In  much  of  the  online  learning  literature,  the  comparator  sequence  is  constrained  to  be  static  or 
time- invariant.  In  this  paper  we  refer  to  the  regret  with  respect  to  a  static  comparator  as  static 
regret: 

Definition  2  (Static  regret).  The  static  regret  of  6t  is 

T  T 

RT{eT)=Y.m)-mmY.^t{9). 

t=i  t=i 

Static  regret  bounds  are  useful  in  characterizing  how  well  an  online  algorithm  performs  relative 
to,  say,  a  loss-minimizing  batch  algorithm  with  access  to  all  the  data  simultaneously.  More  generally, 
static  regret  bounds  compare  the  performance  of  the  algorithm  against  a  static  point  which  can  be 
chosen  with  full  knowledge  of  the  data. 

1.3.2  Tracking  regret 

Static  regret  fails  to  illuminate  the  performance  of  online  algorithms  in  dynamic  settings  where 
the  underlying  parameters  may  be  changing  in  time.  Performance  relative  to  a  temporally- varying 
or  dynamic  comparator  sequence  has  been  studied  previously  in  the  literature  in  the  context  of 
tracking  regret  (also  known  as  shifting  regret)  [28,  16],  and  the  closely-related  concept  of  adaptive 
regret  [33,  26]. 

In  particular,  tracking  regret  compares  the  output  of  the  online  algorithm  to  a  sequence  of 
points  01,02, ...  ,0r  which  can  be  chosen  collectively  with  full  knowledge  of  the  data.  This  is  a 
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fair  comparison  for  a  batch  algorithm  that  detects  and  fits  to  drift  in  the  data,  instead  of  fitting 
a  single  point.  Frequently,  in  order  to  bound  tracking  regret  there  needs  to  be  a  measure  of  the 
complexity  of  the  sequence  61,62, ...  ,0t,  characterized  via  a  measure  of  the  temporal  variability  of 
the  sequence,  such  as 

T-l 

n0T)= 

t=i 

If  this  complexity  is  allowed  to  be  very  high,  we  could  imagine  that  the  comparator  series  would 
fit  the  datastream  and  associated  series  of  losses  closely  and  hence  generalize  poorly.  Conversely, 
if  this  complexity  is  restricted  to  be  0,  the  tracking  regret  is  equivalent  to  static  regret.  Generally, 
sublinear  tracking  regret  is  only  possible  when  the  comparator  sequence  6t  is  piecewise  constant 
(where  6t+i  —  6t  =  0  for  all  but  a  few  t)  or  varying  quite  slowly  over  time  -  that  is,  for  a  small 
family  of  comparators. 
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Summary  of  the  most  important  results 


2.1  Dynamical  models  in  online  convex  programming 

In  contrast  to  previous  tracking  regret  bounds,  we  develop  methods  and  tracking  regret  bounds 
which  scale  with  W^t+i  —  where  t  =  1,  2, ...  is  a  sequence  of  dynamical  models, 

yielding  small  regret  bounds  for  much  broader  classes  of  dynamic  comparator  sequences.  Specifi¬ 
cally,  we  propose  the  alternative  to  (1.1)  and  (1.5)  in  Algorithm  1,  which  we  call  Dynamic  Mirror 
Descent  (DMD).  By  including  in  the  process,  we  effectively  search  for  a  predictor  which  (a) 


Algorithm  1  Dynamic  mirror  descent  (DMD)  with  known  dynamics 

Given  decreasing  sequence  of  step  sizes  rjt  >  0 
Initialize  G  0. 
for  t  =  1, ...  ,T  do 

Observe  xt  and  incur  loss  it  (St) 

Receive  dynamical  model 
Set 


Ot+i  =aTcginmr]t{Vft(9t),0)  +Vtr{9)  +  D{e\\9t)  (2.1a) 

6»ee 

9t+i  =  MOt+i)  (2.1b) 


end  for 


attempts  to  minimize  the  loss  and  (b)  which  adheres  to  the  dynamical  model  <I>t.  This  is  similar 
to  a  stochastic  filter  which  alternates  between  using  a  dynamical  model  to  update  the  “state”, 
and  then  uses  this  state  to  perform  the  hltering  action.  A  key  distinction  of  our  approach  and 
analysis,  however,  is  that  we  make  no  assumptions  about  <I>t’s  relationship  to  the  observed  data. 
Our  approach  effectively  includes  dynamics  into  the  COMID  approach.^  Indeed,  for  a  case  with  no 
dynamics,  so  that  ^t{9)  =  9  for  all  9  and  t,  our  method  is  equivalent  to  COMID. 

Our  main  result  uses  the  following  assumptions: 


•  For  all  t  =  1, . . . ,  T  the  functions  it  and  il)  are  Lipschitz  with  constants  G  and  M  respectively, 
such  that  ||V£i(0)||*  <  G  and  ||V’(^)||*  <  M  for  all  9  £  Q.  The  function  ||  •  ||*  used  in  these 
assumptions  is  the  dual  to  the  norm  in  (1.2). 

•  There  exists  a  constant  Dmax  such  that  D{9i\\92)  <  Dmax  for  all  0i,02  G  O- 

•  For  all  t  =  1, . . .  ,T,  the  transformation  <I>t  has  a  maximum  distortion  factor  A$  such  that 
D(<hf(0i)||<ht(02))  —  D{9i\\92)  <  A$  for  all  9i,92  G  0.  When  A$  <  0  for  all  t,  we  say  that  ‘hi 
satisfies  the  contractive  property. 

^Rather  than  considering  COMID,  we  might  have  used  other  online  optimization  algorithms,  such  as  the  Regular¬ 
ized  Dual  Averaging  (RDA)  method  [57],  which  has  been  shown  to  achieve  similar  performance  with  more  regularized 
solutions.  However,  to  the  best  of  our  knowledge,  no  tracking  or  shifting  regret  bounds  have  been  derived  for  dual 
averaging  methods  (regularized  or  otherwise).  Recent  results  on  the  equivalence  of  COMID  and  RDA  [37]  suggest 
that  the  bounds  derived  here  might  also  hold  for  a  variant  of  RDA,  but  proving  this  remains  an  open  problem. 
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2.1.1  Tracking  regret  bound 


Theorem  3  (Tracking  regret  of  dynamic  mirror  descent).  Let  be  a  dynamical  model  such  that 
<  0  for  t  =  with  respect  to  the  Bregman  used  in  2.1.  Let  the  sequence  6t  he 

generated  using  Alg.  1  using  a  non-increasing  series  pt+i  <  rjt,  with  a  convex,  Lip schitz  function  it 
on  a  closed,  convex  set  Q,  and  let  6t  he  an  arbitrary  sequence  in  0^.  Then 


T>  IQ  \  -C^max  ,  2M  (a  \  ^  ST^ 

Rt[Ot)  < - 1 - V$(0t)  +  /  Nt 

rjT+i  VT  2fT  ^ 


t=l 


T-1 


with  Vi(0r)=  ^  I|6't+1  - 


t=i 


where  V^{6t)  measures  variations  or  deviations  of  the  comparator  sequence  6t  from  the  sequence 
of  dynamical  models  $2,  •  •  ■  j  ^T-  If  Vt  ^  then  for  some  C  >  0  independent  ofT, 

RiOr)  <  cVT(l  +  i4(0r)) 


This  bound  scales  with  the  comparator  sequence’s  deviation  from  the  sequence  of  dynamical 
models  -  a  stark  contrast  to  previous  tracking  regret  bounds  which  are  only  sublinear  for 

comparators  which  change  slowly  with  time  or  at  a  small  number  of  distinct  time  instances.  Note 
that  when  corresponds  to  an  identity  operator,  the  bound  in  Theorem  3  corresponds  to  existing 
tracking  or  shifting  regret  bounds  [17,  16]. 

It  is  intuitively  satisfying  that  this  measure  of  variation,  appears  in  the  tracking  regret 

bound.  First,  if  the  comparator  actually  follows  the  dynamics,  this  variation  term  will  be  very 
small,  leading  to  low  tracking  regret.  This  fact  holds  whether  ^t  is  part  of  the  generative  model 
for  the  observations  or  not.  Second,  we  can  get  a  dynamic  analog  of  static  regret,  where  we 
enforce  =  0.  This  is  equivalent  to  saying  that  the  batch  comparator  is  fitting  the  best 

single  trajectory  using  instead  of  the  best  single  point.  Using  this,  we  would  recover  a  bound 
analogous  to  a  static  regret  bound  in  a  stationary  setting. 

The  condition  that  A$  <  0  is  similar  to  requiring  that  be  a  contractive  mapping.  This 
restriction  is  important;  without  it,  any  poor  prediction  made  at  one  time  step  could  be  exacerbated 
by  repeated  application  of  the  dynamics.  For  instance,  linear  dynamic  models  with  all  eigenvalues 
less  than  or  equal  to  unity  satisfy  this  condition  with  respect  to  the  squared  £2  Bregman  Divergence, 
similar  in  spirit  to  restrictions  made  in  more  classical  adaptive  filtering  work  such  as  [25]. 

Notice  also  that  if  ^t{d)  =  9  in  for  all  t,  then  Theorem  3  gives  a  novel,  previously  unknown 
tracking  regret  bound  for  COMID. 


2.1.2  Data-dependent  dynamics 

An  interesting  example  of  dynamical  models  is  the  class  of  data-dependent  dynamical  models.  In 
this  regime  the  state  of  the  system  at  a  given  time  is  not  only  a  function  of  the  previous  state, 
but  also  the  actual  observations.  One  key  example  of  this  scenario  arises  in  self-exciting  point 
processes,  where  the  state  of  the  system  is  directly  related  to  the  previous,  stochastic  observations. 
Our  algorithm  can  account  for  such  models  since  the  function  <I>t(0)  is  time  varying,  and  therefore 
can  implicitly  depend  on  all  data  up  to  time  t,  i.e.  ^t{9)  =  ^t{9,  xi,  X2, . .  • ,  xt).  Our  regret  bounds 
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therefore  scale  with  how  well  the  comparator  series  matches  these  data  dependent  dynamics: 


R{Ot)  <  C 


T-l 

E 

t=i 


1 + pt+i  -  . . .  ,xi)ii 


Notice  now  that  the  data  plays  a  part  in  the  regret  bounds,  whereas  before  we  only  measured 
the  variation  of  the  comparator.  Data-dependent  regret  bounds  are  not  new.  Concurrent  related 
work  considers  online  algorithms  where  the  data  sequence  is  described  by  a  “predictable  process” 
[43].  The  basic  idea  of  that  paper  is  that  if  one  has  a  sequence  functions  Mt  which  predict  xt 
based  on  xi,  X2,  •  • . ,  xt_i,  then  the  output  of  a  standard  online  optimization  routine  should  be 
combined  with  the  predictor  generated  by  Mt  to  yield  tighter  regret  bounds  that  scale  with  \\xt— 
Mt{xi, . . . ,  However,  [43]  only  works  with  static  regret  (i.e.,  regret  with  respect  to  a 

static  comparator)  and  their  regret  has  a  variation  term  that  expresses  the  deviation  of  the  input 
data  from  the  underlying  process.  In  contrast,  our  tracking  regret  bounds  scale  with  the  deviation 
of  a  comparator  sequence  from  a  prediction  model. 


2.2  Prediction  with  a  finite  family  of  dynamical  models 

DMD  in  the  previous  section  uses  a  single  sequence  of  dynamical  models.  In  practice,  however, 
we  may  not  know  the  best  dynamical  model  to  use,  or  the  best  model  may  change  over  time 
in  nonstationary  environments.  To  address  this  challenge,  we  assume  a  finite  set  of  candidate 
dynamical  models  $2,4,  •  •  • ,  ^N,t}  at  every  time  t,  and  describe  a  procedure  which  uses  this 

collection  to  adapt  to  nonstationarities  in  the  environment.  In  particular  we  establish  tracking 
regret  bounds  which  scale  not  with  the  deviation  of  a  comparator  from  a  single  dynamical  model, 
but  with  how  it  deviates  from  a  series  of  different  dynamical  models  on  different  time  intervals 
with  at  most  m  switches.  These  switches  define  m  +  1  different  time  segments  [ti,ti+i  —  1]  with 
time  points  1  =  <  •  •  •  <  tm+2  =  T.  We  can  bound  the  regret  associated  with  the  best  dynamical 

model  on  each  time  segment  and  then  bound  the  overall  regret  using  a  Prediction  with  Experts 
Advice  algorithm 

Our  dynamic  fixed  share  (DFS)  estimate  is  presented  in  Algorithm  2.  Let  6i^t  denote  the  output 
of  Alg.  1  using  dynamical  models  ^1^2,  ■  ■  ■  ■,  we  choose  Ot  by  using  the  Fixed  Share  forecaster 
on  these  outputs.^  In  Fixed  Share,  each  expert  (here,  each  sequence  of  candidate  dynamical  models) 
is  assigned  a  weight  that  is  inversely  proportional  to  its  cumulative  loss  at  that  point  yet  with  some 
weight  shared  amongst  all  the  experts,  so  that  an  expert  with  very  small  weight  can  quickly  regain 
weight  to  become  the  leader  [27,  16]. 

In  this  update,  A  G  (0, 1)  is  a  parameter  which  controls  how  much  of  the  weight  is  shared 
amongst  the  experts.  By  sharing  some  weight,  it  allows  experts  with  high  loss,  and  therefore  low 

^There  are  many  algorithms  from  the  Prediction  with  Expert  Advice  literature  which  can  be  used  to  form  a  single 
prediction  from  the  predictions  created  by  the  set  of  dynamical  models.  We  use  the  Fixed  Share  algorithm  [27]  as  a 
means  to  combine  estimates  with  different  dynamics;  however,  other  methods  could  be  used  with  various  tradeoffs. 
One  of  the  primary  drawbacks  of  the  Fixed  Share  algorithm  is  that  an  upper  bound  on  the  number  of  switches 
m  must  be  known  a  priori.  However,  this  method  has  a  simple  implementation  and  tracking  regret  bounds.  One 
common  alternative  to  Fixed  Share  allows  the  switching  parameter  (A  in  Alg.  2)  to  decrease  to  zero  as  the  algorithm 
runs  [30,  Ij.  This  has  the  benefit  of  not  requiring  knowledge  about  the  number  of  switches,  but  comes  at  the  price  of 
higher  regret.  Alternative  expert  advice  algorithms  exist  which  decrease  the  regret  but  increase  the  computational 
complexity.  For  a  thorough  treatment  of  existing  methods  see  [24]. 
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Algorithm  2  Dynamic  fixed  share  (DFS) 

Given  decreasing  sequence  of  step  sizes  rjt  >  0  and  rjr  >  0 

Initialize  9i  G  0,  Oi^i  G  0  and  Wi^i  =  for  i  =  1, . . . ,  A,  A  G  (0, 1),  and  rjt,  r]r  >  0. 
for  t  =  1, ...  ,T  do 

Observe  xt  and  incur  loss  itidt) 

Receive  dynamical  model  for  i  =  1, ...  ,N 

for  i  =  1, . . . ,  A  do 

Set 

exp  (-r]r£t 

Y^Wj^texp  (^-r]rit 
i=i 

Wi^t+i  =  jY  + 

9i^t+i  =  aig min r]t{'V ft(9i^t),  9)  +'ntr{9)  +  D{9\\9i^t) 
e&e 

=  ^i,ti9i^t+l) 

end  for 

Set 

N 

9t+i  =  Wi^t+i9i,t+i 

i=l 

end  for 
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weight,  to  quickly  regain  weight  if  they  start  performing  well.  This  is  the  mechanism  that  allows 
fast  switching  between  experts. 

Theorem  4  (Tracking  regret  of  DFS  algorithm).  Assume  all  the  candidate  dynamic  sequences  are 
contractive  such  that  A$  <  0  for  for  all  t  =  1,  ..,T  and  i  =  1,  ...,N  with  respect  to  the  Bregman 
Divergence  in  alg  1.  Then  for  some  C  >  0,  the  dynamie  fixed  share  algorithm  in  Algorithm  2 

with  parameter  A  set  equal  to  rjr  =  iog(AO+miog(T)+i)~  ^  convex, 

Lipschitz  function  It  on  a  closed,  eonvex  set  Q,  has  tracking  regret 

T  T 

Rt{0t)  =  Y.It{l) 

t=i  t=i 

<  C  (Vf  (^V(m  +  1)  logiV  +  mlogT  +  y(™+i)(6tT)))  , 


where 

m+l  ifc+i  — 1 

l/("*+^)(0r)=  min  ^  W^t+i  - 

k=l 

measures  the  deviation  of  the  sequence  9t  from  the  best  sequence  of  dynamical  models  with  at  most 
m  switches  (where  m  does  not  depend  onT). 


Note  that  the  family  of  comparator  sequences  Ot  for  which  Rt{6t)  scales  sublinearly  in  T  is 
significantly  larger  than  the  set  of  comparators  yielding  sublinear  regret  for  MD. 

If  T  is  not  known  in  advance  the  doubling  trick  [17]  can  be  used,  where  temporary  time  horizons 
are  set  of  increasing  length.  Note  that  <  V$.  ^(0^)  for  any  fixed  i  G  {1, . . . ,  N},  thus 

this  approach  yields  a  lower  variation  term  than  using  a  fixed  dynamical  model.  However,  we  incur 
some  loss  by  not  knowing  the  optimal  number  of  switches  m  or  when  the  optimal  switching  times 
are. 


2.3  Parametric  dynamical  models 

Rather  than  having  a  finite  family  of  dynamical  models,  as  we  did  in  Section  2.2,  we  may  consider 
a  parametric  family  of  dynamical  models,  where  the  parameter  a.  G  M”  of  is  allowed  to  vary 
across  a  closed,  convex  domain,  denoted  A.  In  other  words,  we  consider  :  0  x  Al  i— )•  0.  In  this 
context  we  would  like  to  jointly  predict  both  a  and  9.  One  might  consider  defining  (—{O',  a)  as  the 
concatenation  of  9  and  a,  and  then  generating  a  sequence  of  Ct’s  using  COMID  (1.5)  or  DMD  (2.1). 
However,  the  COMID  regret  would  not  capture  deviations  of  a  comparator  sequence  of  predictions 
from  a  series  of  dynamical  models  as  desired.  To  use  DMD,  we  would  need  to  define  a  dynamical 
model  'krOxAli— )>0xAl,  so  that  I>{9,a)  =  {^{9, a), a),  and  use  this  in  place  of  in  (2.1). 
However,  T  is  not  contractive  for  most  of  interest,  so  the  DMD  regret  bounds  would  not  hold. 

To  address  these  challenges,  we  consider  two  approaches.  First,  in  Section  2.3.1  we  consider 
tracking  only  a  finite  subset  of  the  possible  model  parameters,  in  a  manner  similar  to  when  we  had 
a  finite  collection  of  possible  dynamical  models,  which  provide  a  “covering”  of  the  parameter  space. 
In  this  case,  the  overall  regret  and  computational  complexity  both  depend  on  the  resolution  of  the 
covering  set.  Second,  in  Section  2.3.2,  we  consider  a  special  family  of  additive  dynamical  models; 
in  this  setting,  we  can  efficiently  learn  the  optimal  dynamics. 
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2.3.1  Covering  the  set  of  dynamical  models 


In  this  section  we  show  that  by  tracking  a  subset  which  appropriately  covers  the  entire  space  of 
candidate  models,  we  can  bound  the  overall  regret,  as  well  as  bound  the  number  of  parameter 
values  we  have  to  track,  and  the  inherent  tradeoff  between  the  two.  We  propose  to  choose  a  finite 
collection  of  parameters  from  a  closed,  convex  set  A  and  perform  DFS  (Alg.  2)  on  this  collection. 
We  specifically  consider  the  case  where  the  true  dynamical  model  a*  G  ^  is  unchanging  in  time  and 
use  DFS  with  m  =  0.  (Fixed  share  with  m  =  0  amounts  to  the  Exponentially  Weighted  Averaging 
Forecaster  [55,  33,  17].)  In  the  below,  for  any  a  G  yl,  let 

T-l 

Fc,(0T,a)= 

Theorem  5  (Covering  sets  of  dynamics  parameter  space).  LetSN  >  0  and  An  denote  a  covering  set 
for  A  with  cardinality  N ,  such  that  for  every  a  ^  A,  there  is  some  a'  ^  An  such  that  jja  — a'jj  <  £n- 
Define  candidate  dynamical  models  as  <h4(-,  a)  for  a  G  An  and  assume  they  are  all  contractive  with 
respect  to  the  Bregman  Divergence  used  in  Alg.  1.  If  ||l>t(0,a)  —  <I>t(0,/3)||  <  Ljja  —  /3||  for  some 
L  >  0  for  all  a,  j3  ^  A,  then  for  some  constant  C  >  0,  the  DFS  algorithm  with  rjt  =  = 

/  2  iog( iv)  ^  ^  _  Q  yields  d  tracking  regret  bounded  by 


c  {  Vt 


y/log{N)  +  min  I4(0t,  a)  +  Ten 
a£A 


Intuitively,  we  know  that  if  we  set  sat  to  be  very  small  we  will  have  good  performance  because 
any  possible  parameter  value  a  G  .4,  would  have  to  be  close  to  a  candidate  dynamic;  however,  we 
would  need  to  choose  many  candidates.  Conversely,  if  we  run  DFS  on  only  a  few  candidate  models, 
it  will  be  computationally  much  more  efficient  but  our  total  regret  will  grow  due  to  parameter 
mismatch. 


Corollary  6.  Assume  A  C  [Amw,  ,  and  let  'y  >  0  be  given.  Let  k  =  [(Amax  —  Amin)nT'^/2] 

and  d  =  (Amax  Amin)/ (fik')  j  let  An  —  {Amm  “k  d,  Amin  “k  39, . . . ,  Amin(2Ai  1)9}  correspond  to 
an  n-dimensional  grid  with  k^  grid  points  over  A.  Then 

max  min  jja  —  a^jji  <  T”"*'. 

OL^Jk. 

Additionally,  the  total  number  of  grid  points  is  upper  bounded  by 

N  <  ^(^max  -  Amin)?^-^'^  ^  0{T'^'^) 

Under  the  assumptions  of  Theorem  5,  with  this  set  An  and  using  the  fact  that  norms  are  equivalent 
on  finite- dimensional  vectors  /i.e.,  there’s  a  finite  Z  >  0  such  that  jja  —  /3||i  <  Zjja  —  /3||  for  any 
a,/?  G  A  for  any  norm),  we  get  the  following  bound  on  regret  for  some  constant  C  >  0. 


Rt{0t)  <  c  i^Vf 


■J yn log(T)  +  min  V^(6t,  a) 


_|_  rpl-'yn 


Here  we  have  an  explicit  tradeoff  between  regret  and  computationally  accuracy  controlled  by  7, 
since  the  computational  computational  complexity  is  linear  in  =  0(r'’'”).  We  can  further  control 
the  tradeoff  between  computation  complexity  and  performance  by  allowing  en  to  vary  in  time.  This 
could  be  done  by  using  the  doubling  trick,  setting  temporary  time  horizons,  and  then  refining  the 
grid  once  the  temporary  time  horizon  is  reached  using  a  slightly  different  experts  algorithm  which 
could  account  for  the  changing  number  of  experts  as  in  [47]. 
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2.3.2  Additive  dynamics  in  exponential  families 

The  approach  described  above  for  generating  a  covering  set  of  dynamical  models  may  be  effective 
when  the  dimension  of  parameters  is  small;  however,  in  higher  dimensions,  this  approach  can 
require  significant  computational  resources.  In  this  section,  we  consider  an  alternative  approach 
that  only  requires  the  computation  of  predictions  for  a  single  dynamical  model.  We  will  see  that 
in  some  settings,  the  prediction  produced  by  Dynamic  Mirror  Descent  (DMD)  and  a  certain  set 
of  parameters  for  the  dynamic  model  can  quickly  be  converted  to  the  prediction  for  a  different  set 
of  parameters.  While  the  method  described  in  this  section  is  efficient  and  admits  strong  regrets 
bounds,  it  is  applicable  only  for  loss  functions  derived  from  exponential  families. 

The  basics  of  exponential  families  are  described  in  [2,  56],  and  mirror  descent  in  this  setting  is 
explored  in  [5,  42].  We  assume  some  (^  :  X  — )>  which  is  a  measurable  function  of  the  data,  and 
let  (j)k,  k  =  1,2, . . . ,  d,  denote  its  components: 

(t){x)  =  {(t)i{x),...,(j)d{x))^  . 

We  use  the  specific  loss  function 

lt{0)  = -\ogpe{xt)  (2.2a) 

where 

pe{x)=  expK^,  (t){x))  -  Z{e)]  (2.2b) 

for  a  sufficient  statistic  (j)  and  Z{6)=log  f  eyip{{6,(j){x))}dx,  known  as  the  log-partition  function, 
ensures  that  peix)  integrates  to  a  constant  independent  of  9.  Furthermore,  as  in  [5,  42],  we  use 
the  Bregman  divergence  corresponding  to  the  Kullback-Liebler  divergence  between  two  members 
of  the  exponential  family: 


D{e,\\92)  =  Z{ei)  -  Z(02)  -  (VZ(02), e,  -  62). 


In  our  analysis  we  will  be  using  the  Legendre-Fenchel  dual  of  Z  [29,  11]: 


Z*{p)^snp{{p,e)-Z{9)}. 

6»ee 


Let  0*  denote  the  image  of  Int  0  nnder  the  gradient  mapping  VZ  i.e.  0*  =  VZ(Int  0).  An 
important  fact  is  that  the  gradient  mappings  VZ  and  VZ*  are  inverses  of  one  another  [8,  17,  39]: 


YZ*{VZ{e))  =  9  1 
YZ{VZ*{p))=p  j 


G  Int  0,  /i  G  Int  0* 


Following  [17],  we  may  refer  to  the  points  in  Int  0  as  the  primal  points  and  to  their  images  under  VZ 
as  the  dual  points.  For  simplicity  of  notation,  in  the  sequel  we  will  write  p  =  VZ{9),  9  =  VZ*{p), 
fit  =  VZ(9t),  etc. 

Additionally,  we  will  use  a  dynamical  model  that  takes  on  a  specific  form: 


M9,  a)  =  VZ*{AtVZ{9)  +  Bta  +  a)  (2.3) 

for  9  G  Int  &,ct  G  a  G  MT,  At  G  and  Bt  G  At,Bt  and  Ct  are  considered  known. 

Using  these  dynamics,  we  let  9a, t  denote  the  output  of  DMD  (Alg.  1)  at  time  t  and  fia,t  be  its  dual. 
Under  all  these  conditions,  we  have  the  following  Lemma. 
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Lemma  7.  For  any  a,  jl  G  A,  let  the  duals  of  the  initial  prediction  for  DMD  and 

Ki  =  0  G  Additionally  assume  that  the  minimizer  of  equation  2.1a  is  a  point  in  Int  0  for 

any  parameter  a  G  A.  Then  the  DMD  prediction  under  a  dynamical  model  parameterized  by  a  can 
be  calculated  directly  from  the  DMD  prediction  under  a  dynamical  model  parameterized  by  jS  for 
t  >  0  as 

ha,t  =  fii3,t  +  Kt{a  -  P) 

where 

Kt  =  {1  —  rit-i)At-iKt-i  +  Bt-i. 


From  Lemma  7,  we  see  that  the  prediction  for  dynamical  model  a  can  be  computed  simply 
from  the  prediction  using  parameters  P  and  the  value  Kt-  This  is  a  significant  computational  gain 
compared  to  DFS,  where  we  had  to  keep  track  of  predictions  for  each  candidate  dynamical  model 
individually  and  therefore  needed  to  bound  the  number  of  experts  for  tractability. 

Algorithm  3  leverages  Lemma  7  to  simultaneously  track  both  9t  and  the  best  dynamical  model 
parameter  a.  In  this  algorithm,  is  the  function  defined  as 

UiT)HpYZ\p))  =  it{B). 

The  basic  idea  is  the  following:  we  use  mirror  descent  to  compute  an  estimate  of  the  best  dynamical 
model  parameter,  compute  the  DMD  prediction  associated  with  that  parameter,  and  then  use  DMD 
to  update  that  prediction  for  the  next  round. 


Algorithm  3  Dynamic  mirror  descent  (DMD)  with  parametric  additive  dynamics 

Given  decreasing  sequence  of  step  sizes  pt,r]t  >  0 
Initialize  3i  =  0,  iFi  =  0,  6i  G  0,  pi  =  VZ{9i) 

for  t  =  1, ...,  T  do 

Observe  xt 

Incur  loss  it^t)  =  -(Ot,  P^xt))  +  Z{9t) 

Set  gt{a)  = 

Set  at+i  =  proj_4  (at  -  Pt^ gt(ott)) 

Set  =pt  +  Kt{at+i  -  at) 

Set  Jk+i  =  (1  -  Vt)Tt+i  +  VtPixt) 

Set  9j+i  =  VZp(pt+i) 

Set  9t+i  =  ^t{Gt+i,  oit+i) 

Set  Kt+i  =  (1  -  m)AtKt  +  Bt 
end  for 


Theorem  8.  Assume  that  the  observation  space  X  is  bounded.  Let  0  C  &e  a  bounded,  convex 
set  satisfying  the  following  properties  for  a  given  constant  H  >  O.' 

•  For  all  9  G  Q, 

Z(9)=  f  exp  {(0,  (/>(x))}dz^(x)  < +00. 

Jx 

•  For  all  9Ge,  V^Z{9)  h  2HIdy<d- 

•  Let  ft  denote  the  objective  function  in  (2.1a).  For  every  x  G  X  and  t  G  {1,2,3,...},  the 

solution  toargmin/t(0)  occurs  where  Vft  =  0. 
e&e 
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If  the  assumptions  of  Lemma  7  hold,  and  is  contractive  for  all  a  ^  A  with  respect  to  the 

Bregman  Divergence  induced  by  Z{6),  and  loss  function  is  of  the  form  (2.2)  and 


is  convex  in  pL,  and  rjt,  Pt  oc  If  Vi,  then  the  tracking  regret  associated  with  Algorithm  3  for  dynamical 
models  of  the  form  (2.3)  is 


for  some  constant  C  >  0. 

Theorem  8  shows  that  Algorithm  3  allows  us  to  simultaneous  track  predictions  and  dynamics, 
and  we  perform  nearly  as  well  as  if  we  knew  the  best  dynamical  models  for  the  entire  sequence 
in  hindsight.  While  this  approach  is  only  applicable  for  specific  forms  of  the  loss  functions  and 
dynamical  models,  those  forms  arise  in  a  wide  variety  of  practical  problems. 

2.4  Experiments  and  results 

As  mentioned  in  the  introduction,  many  online  learning  problems  can  benefit  from  the  incorporation 
of  dynamical  models.  In  the  below,  we  describe  how  the  ideas  described  and  analyzed  in  this 
paper  might  be  applied  to  anomaly  detection  from  streaming  dynamic  textures,  compressive  video 
reconstruction,  and  analysis  of  neuron  firing  rates  within  networks. 

2.4.1  DMD  experiment:  dynamic  textures  with  missing  data 

As  mentioned  in  the  introduction,  sensors  such  as  the  Solar  Data  Observatory  are  generating  data 
at  unprecedented  rates.  Heliophysicists  have  physical  models  of  solar  dynamics,  and  often  wish 
to  identify  portions  of  the  incoming  data  which  are  inconsistent  with  their  models.  This  “data 
thinning”  process  is  an  essential  element  of  many  big  data  analysis  problems.  We  simulate  an 
analogus  situation  in  this  section. 

In  particular,  we  consider  a  datastream  corresponding  to  a  dynamic  texture  [52]  [20]  ,  where 
spatio-temporal  dynamics  within  motion  imagery  are  modeled  using  an  autoregressive  process.  In 
this  experiment,  we  consider  a  setting  where  “normal”  autoregressive  parameters  are  known,  and 
we  use  these  within  DMD  to  track  a  scene  from  noisy  image  sequences  with  missing  elements. 
(Missing  elements  arise  in  our  motivating  solar  astronomy  application,  for  instance,  when  cosmic 
rays  interfere  with  the  imaging  detector  array.)  As  suggested  by  our  theory,  the  tracking  will  fail 
and  generate  very  large  losses  when  the  posited  dynamical  model  is  inaccurate. 

More  specifically,  the  idea  of  dynamic  textures  is  that  a  low  dimensional,  auto-regressive  model 
can  be  used  to  simulate  a  video  which  replicates  a  moving  texture  such  as  flowing  water,  swirling 
fog,  solar  plasma  flows,  or  rising  smoke.  This  process  is  modeled  in  the  following  way: 


6t  —  A9t-i  +  But 
Xt  =  Cq  T  COt  Dvt 


vt,ut  ~  W(0,/). 
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Noisy  Data 


Noisy  Data,  50%  Missing 


(a)  Noisy  realization  of  a  synthetic  dynamic  water  tex¬ 
ture 


(b)  Noisy  realization  of  a  synthetic  dynamic  water  tex¬ 
ture  with  50%  missing  pixels 


Figure  2.1:  Dynamic  textures  simulation  setup 


In  the  above,  6t  denotes  the  true  underlying  parameters  of  the  system,  and  xt  the  observations. 
The  matrix  A  is  the  autoregressive  parameters  of  the  system,  which  will  be  unique  for  the  type 
of  texture  desired,  Cq  the  average  background  intensity,  C  is  the  sensing  matrix  which  is  usually 
a  tall  matrix,  and  B  and  D  encode  the  strength  of  the  driving  and  observation  noises  respec¬ 
tively.  Using  the  toolbox  developed  in  [44]  and  samples  of  a  220  by  320  pixel  ocean  scene  [20], 
we  learned  two  sets  of  parameters  A,  A'  G  one  representing  the  water  flowing  when  the 

data  is  played  forward,  and  the  other  when  played  backwards,  as  well  as  corresponding  parameters 
Co  G  M^O'^oo,C,C'  G  G  and  D,D'  G  Parameters  (9*  G  [-500,  500]®° 

and  data  xt  G  [—500,500]™^®*^  were  then  generated  using  these  parameters,  with  the  parameters 
A' ,  B' ,  C',  D'  and  Cq  on  t  =  100, ...,  120  and  t  =  300, ...,  320  and  the  parameters  A^  B,  C,  D,  Cq  on 
the  rest  of  t  =  1,  ...,550  according  to  the  above  equations.  Finally,  every  observation  is  corrupted 
by  50%  missing  values,  chosen  uniformly  at  random  at  every  time  point.  Examples  of  the  full  noisy 
data,  and  data  with  missing  values  are  shown  in  Figure  2.1. 

The  parameters  A,  Cq,  and  C  were  then  used  to  define  our  (imperfect)  dynamical  model  for 
DMD,  =  A6,  and  a  loss  function  it{0)  =  \\Pt{C6  —  Cq  —  xt)]]!,  where  Pt  is  a  linear  operator 

accounting  for  the  missing  data.  Note  that  B  and  D  are  not  reflected  in  these  choices  despite 
playing  a  role  in  generating  the  data;  our  theoretical  results  hold  regardless.  We  use  ipi')  =  5II  ‘  II2 
so  the  Bregman  Divergence  D{x\\y)  =  —  i/jjl  is  the  usual  squared  Euclidean  distance,  and  we 

perform  no  regularization  {r{9)  =  0).  We  set  rjt  =  and  ran  100  different  trials  comparing  the 
DMD  method  to  regular  Mirror  Descent  (MD)  to  see  the  advantage  of  accounting  for  underlying 
dynamics.  The  results  are  shown  in  Eigure  2.2. 

There  are  a  few  important  observations  about  this  procedure.  The  hrst  is  that  by  incorporating 
the  dynamic  model,  we  produce  an  estimate  which  visually  looks  like  the  dynamic  texture  of  interest, 
instead  of  the  Mirror  Descent  prediction,  which  looks  like  a  single  snapshot  of  the  water.  Second, 
we  can  recover  a  good  representation  of  the  scene  with  a  large  amount  of  missing  data,  due  to  the 
autoregressive  parameters  being  of  a  much  lower  dimension  than  the  data  itself.  Einally,  because 
we  are  using  the  dynamics  of  forward  moving  water,  when  the  true  data  starts  moving  backward, 
a  change  that  is  imperceptible  visually,  the  loss  spikes,  alerting  us  of  the  abnormal  behavior. 
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Loss  vs  Time  of  Single  Trial 


j  Mean  of  Loss  vs  Time  Over  1 00  T riais 


(a)  Loss  curves  for  proposed  dynamic  mirror  de-  (b)  Loss  curves  for  DMD  and  MD  against  time 
scent  (DMD)  method  and  mirror  descent  (MD)  over  100  trials, 
against  time  for  a  single  trial. 


Figure  2.2:  Simulation  results  for  the  experiment  in  Section  2.4.1.  The  vertical  dashed  lines  indicate 
the  intervals  where  the  posited  dynamical  model  was  not  reflected  by  the  underlying  data;  note 
the  sharp  increases  in  the  losses  associated  with  DMD  over  those  intervals,  particularly  in  contrast 
with  the  losses  associated  with  MD.  Standard  online  learning  methods  like  MD  do  not  facilitate 
the  detection  of  subsets  of  data  which  do  not  fit  hypothesized  physical  models. 


2.4.2  DFS  experiment:  social  network  analysis 


To  demonstrate  the  performance  of  Dynamic  Mirror  Descent  (DMD)  combined  with  the  Fixed  share 
algorithm  (which  we  call  Dynamic  Fixed  Share  (DFS)),  we  consider  two  scenarios:  reconstruction 
of  a  dynamic  scene  {i.e.,  video)  from  sequential  compressed  sensing  observations,  and  tracking 
connections  in  a  dynamic  social  network. 

Dynamical  models  have  a  rich  history  in  the  context  of  social  network  analysis  [49],  but  we  are 
unaware  of  their  application  in  the  context  of  online  learning  algorithms.  To  show  how  DMD  can 
bridge  this  gap,  we  track  the  influence  matrix  of  seats  in  the  US  Senate  from  1795  to  2011  using 
roll  call  data  (http:/ /www. voteview.com/dwnl.htm).  At  time  t,  we  observe  the  “yea”  or  “nay”  vote 
of  each  Senator,  which  we  represent  with  a  +1  or  —1.  When  a  Senator’s  vote  is  unavailable  (for 
instance,  before  a  state  joined  the  union),  we  use  a  0.  We  form  a  length  p  =  100  vector  of  these 
votes  indexed  by  the  Senate  seat,  and  denote  this  xt- 

Following  [45],  we  form  a  loss  function  using  a  negative  log  Ising  model  pseudolikelihood  to 
sidestep  challenging  issues  associated  with  the  partition  function  of  the  Ising  model  likelihood. 

For  a  social  network  with  p  agents,  9t  G  [—1, 1]^*^^,  where  {Ot)ab  corresponds  to  the  correlation 
in  voting  patterns  between  agents  a  and  h  at  time  t.  Let  V  denote  the  set  of  agents,  V\a  the  set  of 
all  agents  except  a,  Xa  the  vote  of  agent  a,  and  0a—{0ah  ■  b  G  V}.  Our  loss  function  is 


7’t“^(^a)-log  exp  (20  aaXa  +  2  Xy6gV\a  ^abXaXb^  + 


{0a;  x)=  -  20aaXa  “  2  Y.h(^y\a  ^abXaXb  +  tT’  {^a) 
and  r{0)  =  t||0||i,  where  r  >  0  is  a  tuning  parameter;  this  loss  is  convex  in  0.  We  let  i(’{0)  =  5I 


(a)/ 


18 


CHAPTER  2.  SUMMARY  OF  THE  MOST  IMPORTANT  RESULTS 


and  use  a  dynamical  model  inspired  by  [49],  where 

{^iO)ab  = 

f  (1  9bc*  if  l^ac*  Gbc*  \  >  \0ah\ 

1  Oab  otherwise 

where  c*  =argminc  j^ac^fecl- 

The  intuition  is  that  if  two  members  of  the  network  share  a  strong  common  connection,  they 
will  become  connected  in  time.  We  set  ai  G  {0,  .001,  .002,  .003,  .004}  for  the  different  dynamical 
models.  We  set  r  =  .1  and  again  set  rj  using  the  doubling  trick  with  time  horizons  at  set  at 
increasing  powers  of  10.  As  in  [31],  we  find  that  regularizing  {e.g.,  thresholding)  every  10  steps, 
instead  of  at  each  time  step,  allows  for  the  values  to  grow  above  the  threshold  for  meaningful 
relationships  to  be  found. 

The  first  plot  in  Figure  2.3  shows  the  average  per  round  loss  of  each  model,  and  the  DFS  estima¬ 
tor  over  a  30  year  time  window.  We  see  that  applying  the  dynamical  model  dramatically  improves 
performance  relative  to  COMD  [at  =  0)  and  that  the  FS  estimator  aggregates  the  predictions 
successfully.  The  next  plot  shows  the  moving  average  losses  for  a  few  select  Senators,  where  high 
loss  corresponds  to  unpredictable  behavior.  Notice  that  John  Kerry  (D-MA)  has  generally  very 
low  loss,  but  spikes  around  2006,  but  then  drops  again  before  a  reelection  campaign  in  2008. 


Average  Per  Round  Loss 


Figure  2.3:  Tracking  a  dynamic  social  network.  Losses  for  different  dynamical  models  and  the  DFS 
predictions;  a  =  0  corresponds  to  COMD. 
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Loss  of  Select  Senators 


Figure  2.4;  Predictability  of  individual  senators.  Low  losses  correspond  to  predictable,  consistent 
voting  behavior,  while  higher  loss  means  less  predictable 


1859 


1877 


1967 


1887 


Figure  2.5:  Influence  matrices  for  select  years  spanning  Civil  War  and  Civil  Rights  Movement  to 
present.  We  see  tight  factions  forming  in  the  mid-  to  late- 1800s  (post  Civil  War),  followed  by  a 
time  when  the  faction  dissipate  in  the  mid-1900s  during  the  Civil  Rights  Movement  and  upheaval 
among  southern  Democrats.  (Best  viewed  in  color.) 


By  looking  at  the  network  estimates  of  the  DFS  estimator  across  time  (as  in  Figure  2.5)  we  can 
see  some  interesting  behavior  in  the  network  that  corresponds  to  historical  events,  as  described  in 
the  caption.  Finally,  we  see  factions  again  forming  in  more  recent  times.  The  seats  are  sorted  sep¬ 
arately  for  each  matrix  to  emphasize  groupings,  which  are  strongly  correlated  with  known  political 
factions. 


2.4.3  DFS  experiment:  compressive  video  reconstruction 

There  is  increasing  interest  in  using  “big  data”  analysis  techniques  in  applications  like  high- 
throughput  microscopy,  where  scientists  wish  to  image  large  collections  of  specimens.  This  work 
is  facilitated  by  the  development  of  novel  microscopes,  such  as  the  recent  fluorescence  microscope 
based  on  structured  illumination  and  compressed  sensing  principles  [51].  However,  measurements  in 
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such  systems  are  acquired  sequentially,  posing  significant  challenges  when  imaging  live  specimens. 

Knowledge  of  underlying  motion  in  compressed  sensing  image  sequences  can  allow  for  faster, 
more  accurate  reconstruction  [34,  41,  46].  By  accounting  for  the  underlying  motion  in  the  image 
sequence,  we  can  have  an  accurate  prediction  of  the  scene  before  receiving  compressed  measure¬ 
ments,  and  when  the  measurements  are  noisy  and  the  number  of  observations  is  far  less  than  the 
number  of  pixels  of  the  scene,  these  predictions  allow  both  fast  and  accurate  reconstructions.  If 
the  dynamics  are  not  accounted  for,  and  previous  observations  are  used  as  prior  knowledge,  the 
reconstruction  could  end  up  creating  artifacts  such  as  motion  blur  or  overfitting  to  noise.  There 
has  been  significant  recent  interest  in  using  models  of  temporal  structure  to  improve  time  series 
estimation  from  compressed  sensing  observations  [4,  54];  the  associated  algorithms,  however,  are 
typically  batch  methods  poorly  suited  to  large  quantities  of  streaming  data.  In  this  section  we 
demonstrate  that  DMD  helps  bridge  this  gap. 

In  this  section,  we  simulate  fluorescence  microscopy  data  generated  by  the  system  in  [51]  while 
imaging  a  paramecium  moving  in  a  2-dimensional  plane;  the  frame  is  denoted  6t  (a  120  x  120 
image  stored  as  a  length- 14400  vector)  which  takes  values  between  0  and  1.  The  corresponding 
observation  is  xt  =  AfOt  -|-  rit,  where  At  is  a  50  x  14400  matrix  with  each  element  drawn  iid  from 
M{0, 1)  and  nt  corresponds  to  measurement  noise  with  nt  A/’(0,cj^)  with  =  0.1.  This  model 
coincides  with  several  compressed  sensing  architectures  [21,  51]. 

Our  loss  function  uses  ft{0)  =  ~  where  r  >  0  is  a  tuning 

parameter.  We  construct  a  family  of  =  9  dynamical  models,  where  shifts  the  (unvec¬ 

torized)  frame,  9,  one  pixel  in  a  direction  corresponding  to  an  angle  of  2'ni/{N  —  1)  as  well  as  a 
“dynamic”  corresponding  to  no  motion.  (With  the  zero  motion  model,  DMD  reduces  to  COMID.) 
The  true  video  sequence  uses  different  dynamical  models  over  t  =  {1, . . .  ,550}  (upward  motion) 
and  t  =  {551, ...,  1000}  (motion  to  the  right).  Finally,  we  use  V'(')  =  5II  •  Hi  so  the  Bregman 
Divergence  D{x\\y)  =  ^[[a:  — ylli  is  the  usual  squared  Euclidean  distance.  The  DMD  sub-algorithms 
use  Vt  =  :^,T  =  .002  and  the  DFS  forecaster  uses  A  =  and  rjr  is  set  as  in  Theorem  4. 

The  experiment  was  then  run  100  times. 

Loss  vs  Time,  Single  Trial  Loss  vs  Time,  Avg  Over  100  Trials 


(a)  Losses  for  a  single  trial.  (b)  Losses  averaged  over  100  trials. 

Figure  2.6:  Tracking  dynamics  using  DFS  and  comparing  individual  models  for  directional  motion 
for  the  experiment  in  Section  2.4.3.  Only  shown  are  DMD  losses  for  motions  which  are  true  before 
or  after  t  =  550  for  clarity.  Before  t  =  550  the  upward  motion  dynamic  model  incurs  small  loss, 
where  as  after  t  =  550  the  motion  to  the  right  does  well,  and  DFS  successfully  tracks  this  change. 
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DMD  Prediction 


Ground  Truth  Correct  Dynamic 


COMID  Prediction  DPS  Prediction 


Figure  2.7:  Dynamic  compressed  sensing  experimental  results.  Zoomed  in  instantaneous  predictions 
at  t  =  1000  for  the  experiment  in  Section  2.4.3.  Top  Left:  0*.  Top  Right:  0Right,i-  Bottom  Left: 
^COMiD,t-  Bottom  Right:  Ot-  The  prediction  made  with  the  prevailing  motion  is  an  accurate 
representation  of  the  ground  truth,  while  the  prediction  with  the  wrong  dynamic  is  an  unclear 
picture.  The  DFS  algorithm  correctly  picks  out  the  cleaner  picture. 


Figures  2.6  and  2.7  show  the  impact  of  using  DFS.  We  see  that  DFS  switches  between  dynamical 
models  rapidly  and  outperforms  all  of  the  individual  predictions,  including  COMID,  used  as  a 
baseline,  to  show  the  advantages  of  incorporating  knowledge  of  the  dynamics. 


2.4.4  DMD  with  parametric  additive  dynamics:  social  network  tracking 

Finally,  we  look  at  self-exciting  point  processes  on  connected  networks  [10,  32].  Here  we  assume 
there  is  an  underlying  rate  for  nodes  in  a  network  which  dictate  how  likely  each  node  is  to  participate 
in  an  action.  Then,  based  on  which  nodes  act,  it  will  increase  other  nodes  likelihood  to  act  in  a 
dynamic  fashion.  For  instance,  in  the  context  of  social  networks,  we  may  observe  events  such  as 
people  meeting,  corresponding,  voting,  or  sharing  information  [42,  48,  50,  10,  59],  and  from  this 
data  wish  to  infer  who’s  actions  are  influencing  whose,  and  how  those  influences  evolve  over  time. 
In  a  biological  neural  network,  a  node  could  correspond  to  a  neuron  and  an  action  could  correspond 
to  a  neural  spike  [13]. 

We  simulate  observations  of  a  such  a  self-exciting  point  process  in  the  following  way: 

Ht+i  =  W)  =THt  +  Wxt  +  (1  -  t)/2 

xt  ~Poisson(^f) 

For  our  experiments  £  (0,5]^*^^  represents  the  average  number  of  actions  each  of  100  nodes 
will  make  during  time  interval  t,  and  W  G  [0,5]^*^^^^®*^  reflects  the  unknown  underlying  network 
structure  which  encodes  how  much  an  event  by  a  one  node  will  increase  the  likelihood  of  an  event 
by  another  node  in  future  time  intervals.  Here  we  assume  r  is  a  known  parameter  between  zero 
and  one,  fl  G  is  a  underlying  base  event  rate. 
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Moving  Avg  Loss,  Avg  over  1000  Trials 


0  0.5  1  1.5  2  2.5  3  3.5 

Time 


True  W 


Estimated  W 


(a)  Moving  average  loss  over  previous  100  time  (b)  The  true  value  and  final  estimate  of  W  computed  using 
points  for  DMD  with  a  known  W  matrix,  MD,  Alg  3. 
and  DMD  exp  averaged  over  1000  trials. 


Figure  2.8:  Experimental  results  tracking  a  self-exciting  point  process  on  a  network,  described  in 
Section  2.4.4.  Notice  how  the  loss  curve  for  Alg.  3  approaches  the  DMD  curve  (associated  with 
clairvoyant  knowledge  of  the  underlying  network  matrix  W)  as  the  estimate  of  W  improves,  and 
significantly  outperforms  conventional  mirror  descent. 


Our  goal  is  to  track  the  event  rates  [Xt  and  the  network  model  W  simultaneously;  Algorithm  3 
is  applied  with 

lt{0)  =(l,exp(6l))  -  {xt,6),  itip)  ={I,t)  -  {xt,log^), 

Z{0)  ={l,exp{9)),  /r  =VZ(0)  =  exp(0). 

We  generated  data  according  to  this  model  for  t  =  1,  ...,50000  for  1000  different  trials,  using 
T  =  0.5, /2  =  0.1  and  W  generated  such  that  it  is  all  zeros  except  on  each  distinct  10  x  10  block 
along  the  diagonal,  elements  are  chosen  to  be  uu^  for  a  vector  u  G  [0.1,1.!]^*^  with  elements 
chosen  uniformly  at  random.  The  matrix  W  is  then  normalized  so  that  its  spectral  norm  is  0.25 
for  stability.  Using  this  generated  data  we  ran  DMD  with  known  W  (Alg.  1),  MD,  and  DMD 
with  additive  dynamics  (Alg.  3)  to  learn  the  dynamic  rates.  The  step  size  parameters  were  set 
as  rji  =  .^/y/i  and  pt  =  .005/\/t.  The  results  are  shown  for  DMD  with  the  matrix  W  known  in 
advance,  MD  and  Alg.  3  in  Figure  2.8. 

We  again  see  several  important  characteristics  in  these  plots.  The  first  is  that  by  incorporating 
knowledge  of  the  dynamics,  we  incur  significantly  less  loss  than  standard  Mirror  Descent.  Secondly, 
we  see  that  even  without  knowing  what  the  values  of  the  matrix  W ,  we  can  learn  it  simultaneously 
with  the  rate  vectors  pt  from  streaming  data,  and  the  resulting  accurate  estimate  leads  to  low  loss 
in  the  estimates  of  the  rates. 


2.5  Conclusions  and  future  directions 

Processing  high-velocity  streams  of  high-dimensional  data  is  a  central  challenge  to  big  data  analysis. 
Scientists  and  engineers  continue  to  develop  sensors  capable  of  generating  large  quantities  of  data, 
but  often  only  a  small  fraction  of  that  data  is  carefully  examined  or  analyzed.  Fast  algorithms  for 
sifting  through  such  data  can  help  analysts  track  dynamic  environments  and  identify  important 
subsets  of  the  data  which  are  inconsistent  with  past  observations. 
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In  this  paper  we  have  proposed  a  novel  online  optimization  method,  called  Dynamic  Mirror 
Descent  (DMD),  which  incorporates  dynamical  models  into  the  prediction  process  and  yields  low 
regret  bounds  for  broad  classes  of  comparator  sequences.  The  proposed  methods  are  applicable 
for  a  wide  variety  of  observation  models,  noise  distributions,  and  dynamical  models.  There  is  no 
assumption  within  our  analysis  that  there  is  a  “true”  known  underlying  dynamical  model,  or  that 
the  best  dynamical  model  is  unchanging  with  time.  The  proposed  Dynamic  Fixed  Share  (DFS) 
algorithm  adaptively  selects  the  most  promising  dynamical  model  from  a  family  of  candidates  at 
each  time  step.  Additionally  we  show  methods  which  learn  in  parametric  families  of  dynamical 
models.  In  experiments  DMD  shows  strong  tracking  behavior  even  when  underlying  dynamical 
models  are  switching,  in  such  applications  as  dynamic  texture  analysis,  compressive  video,  and 
self-exciting  point  process  analysis. 
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